Improved Nearest Neighbor Methods For Text Classification
نویسندگان
چکیده
We present new nearest neighbor methods for text classification and an evaluation of these methods against the existing nearest neighbor methods as well as other well-known text classification algorithms. Inspired by the language modeling approach to information retrieval, we show improvements in k-nearest neighbor (kNN) classification by replacing the classical cosine similarity with a KL divergence based similarity measure. We also present an extension of kNN to the semi-supervised case which turns out to be a formulation that is equivalent to semi-supervised learning with harmonic functions. In both supervised and semi-supervised experiments, our algorithms surpass traditional nearest neighbor methods and produce competitive results when compared to the state-of-the-art methods such as Support Vector Machines (SVM) and transductive SVM on the Reuters-21578 dataset, the 20 Newsgroups dataset, and the Reuters Corpus Volume I (RCV1) dataset. To our knowledge, this paper presents one of the most comprehensive evaluation of different machine learning algorithms on the entire RCV1 dataset.
منابع مشابه
An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملAn Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملImproved Nearest Neighbor Methods For Text Classification With Language Modeling and Harmonic Functions
We present new nearest neighbor methods for text classification and an evaluation of these methods against the existing nearest neighbor methods as well as other well-known text classification algorithms. Inspired by the language modeling approach to information retrieval, we show improvements in k-nearest neighbor (kNN) classification by replacing the classical cosine similarity with a KL dive...
متن کاملIntegrating Background Knowledge into Nearest-Neighbor Text Classification
This paper describes two different approaches for incorporating background knowledgeinto nearest-neighbor text classification.Our first approachuses backgroundtext to assessthe similarity betweentraining and test documentsrather than assessing their similarity directly. The second method redescribes examples using Latent Semantic Indexing on the background knowledge, assessing document similari...
متن کاملImproved Nearest Neighbor Based Approach to Accurate Document Skew Estimation
The nearest-neighbor based document skew detection methods do not require the presence of a predominant text area, and are not subject to skew angle limitation. However, the accuracy of these methods is not perfect in general. In this paper, we present an improved nearest-neighbor based approach to perform accurate document skew estimation. Size restriction is introduced to the detection of nea...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011